ERROR TRACKING METHOD AND SYSTEM 



BACKGROUND 

1. Technical Field 

5 [0001] The present invention relates to identifying defective components in a 

computing system. More particularly, the invention concerns storing information 
conceming the paths of data objects in a computing system to facilitate identifying 
defective components. 

10 2. Description of Related Art 

[0002] Important data is frequently stored in computing systems. If a data 
object becomes corrupted, it is desirable to be able to quickly identify the cause of the 
data corruption, so the problem can be eliminated. A problem may be eliminated, for 
example, by replacing a defective component. Quickly identifying and replacing 

1 5 defective components can limit the amount of corrupted data and associated costs. The 
task of identifying the cause of data corruption is particularly challenging in computing 
systems that utilize a large number of storage devices, and which have a large number of 
paths over which data objects may travel, such as when a storage area network (SAN) is 
utilized. 

20 [0003] Utilizing a storage area network increases the complexity of a computing 

system. When a data object is stored in a storage area network, the data path is rarely a 
simple point-to-point transfer, and instead, may involve multiple interfaces and devices. 
Consequently, a data object may travel over any of a number of paths between a source 
and a destination when being stored, hi this case, when an error is detected in stored 

25 data, it is often difficult to determine the cause of the error. Computing environments of 
directly attached storage also suffer from similar problems. 

[0004] One traditional method of error tracking involves examining information 
that devices in a computing system provide for diagnosing problems. For example, 
device logs may be examined to try to identify one or more devices that have experienced 
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an error, and to try to identify the type of error that has occurred. Server error reports 
may also be examined. One problem with this technique is that an error investigation 
may take place after the error has been flushed from the relevant error logs, and 
consequently the error cannot be determined. Also, even if a device that has experienced 
5 an error is identified, it is often difficult to determine whether the path of a particular data 
object included the device that experienced the error. 

[OOOS] It is difficult and time consuming to examine every device in a storage 
area network or large network of locally attached devices.. The niraiber of devices 
connected to a storage area network may be large, and the number of possible 

1 0 connections between devices increases exponentially as the number of storage devices 
increases. Consequently, with large storage area networks, examining error logs, and 
determining when or if a device handled a data object is a daxmting, if not impossible 
task. The difficulty is compounded in heterogeneous computing environments. 
Frequently, storage management software is erroneously blamed for data errors. 

1 5 [0006] Cyclical Redundancy Checking (CRC) is another known technique for 

error detection. However, the usefulness of utiUzing CRC checking for error detection is 
limited because many computing environments cannot tolerate the performance cost of 
CRC techniques at each transfer. Additionally, CRC checking facilitates identifying only 
a limited set of devices in the data path of a data object for further investigation. 

20 [0007] In another known technique for error tracking, devices in a storage area 

network are relied upon to report data transfer errors to a storage manager server, so the 
server can notify a client to retry an operation if necessary. However, the success of this 
technique is dependent on errors being reliably reported to the server, which often does 
not occur. For example, a defective device itself may not detect an error, and therefore 

25 will not make an entry in its error log, and will not report the error to the server or a 

calling application. Examples of undetected errors that devices may fail to report include 
flipped bits and the failure to store a file. 

[0008] In addition to the factors discussed above, traditional error tracking 
methods are often inadequate when data storage errors are intermittent and are associated 
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with individual data objects, which is frequently the case. In summary, known error 
tracking techniques are generally inadequate for quickly and accurately identifying 
malfunctioning components in a computing system. 
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SUMMARY 

[0009] One aspect of the invention is a method for error tracking. An example of 
the method includes sending a data object to a first location. This example of the method 
also includes determining if the data object was successfully stored at the first location, 
5 and if so, storing meta data corresponding with the data object, wherein the meta data 
includes first path information. This example of the method also includes sending the 
data object to a second location. This example of the method further includes 
determining if the data object was successfully stored at the second location, and if so, 
adding second path information to the meta data corresponding with the data object, to 

1 0 update the meta data. A direct effect of some examples of this method is the 
establishment of a history of devices that have handled a data object. 

[0010] Other aspects of the invention are described in the sections below, and 
include, for example, a computing system, and a signal bearing medium tangibly 
embodying a program of machine-readable instructions executable by a digital processing 

1 5 apparatus to perform a method for error tracking. 

[0011] The invention provides a number of advantages. For example, some 
examples of the invention permit quickly identifying defective components that have 
caused data errors in a computing system. Additionally, some examples of the invention 
provide the benefit of identifying data that may have been corrupted by defective 

20 components. The invention also provides a number of other advantages and benefits, 
which should be apparent firom the following description. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
[0012] FIG. 1 is a block diagram of the hardware components and 
interconnections of a computing system in accordance with an example of the invention. 
[0013] FIG. 2 is a block diagram of the hardware components and 
5 interconnections of a computing system in accordance with another example of the 
invention. 

[0014] FIG. 3 is an example of a signal-bearing medium in accordance an 
example of the invention. 

[0015] FIGS. 4A and 4B are a flowchart of an operational sequence for error 
1 0 tracking in accordance with an example of the invention. 



IBM Docket No. SJO920030042US1 



-5- 



DETAILED DESCRIPTION 
[0016] The nature, objectives, and advantages of the invention will become 
more apparent to those skilled in the art after considering the following detailed 
description in connection with the accompanying drawings. 

1. HARDWARE COMPONENTS AND INTERCONNECTIONS 
[0017] One aspect of the invention is a computing system that can be used for 
error tracking. As an example, the computing system may be embodied by all, or 
portions of, the computing system 100 shown in FIG. 1. The computing system 100 
includes a client computer 102. More than one cUent 102 could be included. The client 
102 may run any suitable operating system, for example, Windows 2000, AIX, Solaris™, 
Linux, UNIX, or HP-UX™. The client 102 may be implemented on any suitable 
computing device, for example a personal computer, a workstation, a mainframe 
computer, or a supercomputer. 

[0018] The client 102 has a processor 104 (which may be called a processing 
device), and in some examples could have more than one processor. As an example, the 
processor 104, may be a PowerPC RISC processor, available from International Business 
Machines Corporation. In another example, the processor could be a processor 
manufactured by Intel Corporation. The cUent 102 also may have a storage that may 
include a primary memory 106 (for example RAM), and/or a non-volatile memory 108, 
both of which are coupled to the processor 104. The storage may be used to store data 
and application programs and/or other programming instructions executed by the 
processor 104. The application programs could generally be any suitable applications. 
The non-volatile memory 108 could be, for example, a hard disk drive, a drive for 
reading and writing from optical or magneto-optical media, a tape drive, non-volatile 
RAM (NVRAM), or any other suitable type of storage. In some examples the primary 
memory 106 or the non-volatile memory 108 could be eliminated, or the primary memory 
106 and/or the non- volatile memory 108 could be provided on the processor 104, or 
ahematively, external from the client 102. 
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[0019] The computing system 100 also includes a server computer 110. More 
than one server 110 may be included. As an example, the server 110 could be a personal 
computer (having for example, an Intel processor running the Windows or Linux 
operating system), a computer workstation, a mainframe computer, a supercomputer (for 
5 example an IBM SP2 running the ADC operating system), or any other suitable 

computing device running any suitable operating system. As another example, the server 
1 10 could be an IBM xSeries server. In another example, the server 1 10 could be an 
IBM zSeries Parallel Sysplex server, such as a zSeries 900, running the z Operating 
System (z/OS). In another example, the server 1 10 could be an IBM S/390 server 

10 running OS/390. 

[0020] The server 1 10 has a processor 1 12 (which may be called a processing 
device), and in some examples could have more than one processor. As an example, the 
processor 1 12, may be a PowerPC RISC processor, available from Intemational Business 
Machines Corporation. In another example, the processor could be a processor 

1 5 manufactured by Intel Corporation. In one specific example, the server 110 could use an 
IBM P690 power 4 processor, running the AIX operating system and TivoU Storage 
Manager (TSM), all of which are available from IBM Corporation. The server may also 
have a storage, which may include a primary memory 1 14 (for example RAM), and/or a 
non-volatile memory 1 16, both of which are coupled to the processor 1 12. The storage 

20 may be used to store data and appHcation programs and/or other programming 

instructions executed by the processor 1 12. The programs could generally be any 
suitable programs, and could include for example, DB2, Oracle, or custom applications. 
The non- volatile memory 116 could be, for example, a hard disk drive, a drive for 
reading and writing from optical or magneto-optical media, a tape drive, non-volatile 

25 RAM (NVRAM), or any other suitable type of storage. In some examples the primary 

memory 1 14 or the non-volatile memory 1 16 could be eliminated, or the primary memory 
1 14 and/or the non-volatile memory 1 16 could be provided on the processor 1 12, or 
altematively, external from the server 1 10. The server 1 10 is coupled to the client 102 
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over a connection 118, which may be for example, an ESCON or ETHERNET 
connection, and as an example, may be an IP network. 

[0021] The server 110 may manage data backup operations, and store file 
system meta-data, for example, data object path information that may include dates and 
5 times that a data object is stored, source identifiers, and destination identifiers. The meta- 
data may also include a file creation time and file security information. 

[0022] The computing system 100 also includes at least one storage device. As 
an example, a plurality of storage devices may be directly coupled to the server 1 10 as 
shown in FIG. 1. The plurality of storage devices may include, for example, hard drives 

1 0 120a-c, and tape drives 122a-c. Each of the storage devices in the plurality of storage 

devices could be implemented with any suitable type of storage device, using any suitable 
technology, such as magnetic, optical, magneto-optical, or electrical. For example, 
suitable storage devices could include hard disk drives, optical disks or discs (for 
example, CD-RW, DVD-RW, or DVD+RW), floppy disks, magnetic data storage disks 

15 or diskettes, magnetic tape, digital optical tape, a tape library, EPROMs, EEPROMs, 
RAM, Non- Volatile RAM, and flash memory. Additionally, CD-R, WORM, DVD-R, 
and/or DVD+R devices could be included in the plurality storage devices. As an 
example, the disk storage may implemented with an IBM FAStT 900, and the tape library 
may be implemented with an IBM 3494 tape library using IBM 3590 drives, all of which 

20 are available fi'om IBM Corporation. In another example, one or more of the storage 
devices could be implemented with a model 2105 Enterprise Storage Server, available 
fi'om IBM Corporation. Each storage device 120a-c, 122a-c may be coupled to the server 
110 using, for example. Fibre Channel or SCSI protocol. 

[0023] Referring to FIG. 2, the computing system may altematively be 

25 embodied by all, or portions of, the computing system 200. As in the computing system 
100 in FIG. 1, the computing system 200 in FIG. 2 includes the client 102, the server 1 10, 
and the connection 1 18. In contrast to the computing system 100 in FIG. 1, in the 
computing system 200 in FIG. 2 the plurality of storage devices 206a-d, 208a-e are 
coupled to the server 1 10 through a storage area network (SAN) 204. The plurality of 
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storage devices may include, for example, hard drives 206a-d, and tape drives 208a-e, 
and may include any of the types of storage devices discussed above with regard to the 
computing system 100 in FIG. L Using the storage area network 204 permits coupling a 
large number of storage devices to the server 110. As an example, the storage area 
5 network 204 may be implemented using Fibre Channel network(s) and/or Ethernet 
network(s). The invention need not be implemented with a storage area network, and 
may be implemented with directly attached storage, as shown in FIG. 1, wherein the 
plurality of storage devices are coupled directly to the server 110. 

[0024] In another example, the computing system could be implemented with a 
1 0 storage area network and the architecture for the Storage Tank™ file system and storage 
management system (not shown), which is available from IBM Corporation. 



11. OPERATION 

[0025] In addition to the hardware embodiments described above, another 
1 5 aspect of the invention concerns a method for error tracldng. 

A. Signal-Bearing Media 
[0026] In the context of FIGS. 1 and 2, the method aspect of the invention may 
be implemented, for example, by having the server 1 10, execute a sequence of machine- 

20 readable instructions, which can also be referred to as code. These instructions may 

reside in various types of signal-bearing media. In this respect, one aspect of the present 
invention concerns a programmed product, comprising a signal-bearing medium or 
signal-bearing media tangibly embodying a program of machine-readable instructions 
executable by a digital processmg apparatus to perform a method for error tracking. 

25 [0027] This signal-bearing medium may comprise, for example, primary 

memory 114 and/or non-volatile memory 116. Alternatively, the instructions may be 
embodied in a signal-bearing medium such as the optical data storage disc 300 shown in 
FIG. 3. The optical disc can be any type of signal bearing disc or disk, for example, a 
CD-ROM, CD-R, CD-RW, WORM, DVD-R, DVD+R, DVD-RW, or DVD+RW. 
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Additionally, whether contained in the computing system 100, the computing system 200, 
or elsewhere, the instructions may be stored on any of a variety of machine-readable data 
storage mediums or media, which may include, for example, a "hard drive", a RAID 
array, a RAMAC, a magnetic data storage diskette (such as a floppy disk), magnetic tape, 
5 digital optical tape, RAM, ROM, EPROM, EEPROM, flash memory, magneto-optical 
storage, paper punch cards, or any other suitable signal-bearing media including 
transmission media such as digital and/or analog commimications links, which may be 
electrical, optical, and/or wireless. As an example, the machine-readable instructions 
may comprise software object code, compiled from a language such as "C-H-". 

10 

B. Overall Sequence of Operation 
[0028] For ease of explanation, but without any intended limitation, the method 
aspect of the invention is described with reference to the computing system 100 described 
above and shown in FIG. 1, and the computing system 200 described above and shown in 

1 5 FIG. 2. An example of the method aspect of the present invention is illustrated in FIGS. 
4A and 4B, which show a sequence 400 for a method for error tracking. As an example, 
the operations in the sequence 400 may be performed by the server 1 10 in conjunction 
with the storage devices 206a-d, 208a-e. Referring to FIG. 4A, the sequence 400 may 
include, and begin with, operation 402 which comprises sending a data object to a first 

20 location. Operating 402 may comprise attempting to store the data object at the first 

location. As an example, the first location may be in a pool of disks coupled to a server 
such as the server 1 10. In another example, the first location may be in any storage 
device 206a-d, 208a-e in the storage area network 204. As a specific example, the data 
object may be stored in one of a Tivoli Storage Manager server's disks. The sequence 

25 may also include operation 404 which comprises determining if the data object was 

successfiiUy stored at the first location. If in operation 404 it is determined that the data 
object was not successfiiUy stored at the first location, then operation 402 may be 
repeated. If it is determined in operation 404 that the data object was successfiiUy stored 
at the first location, the sequence 400 may also include operation 406, which comprises 
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storing meta data corresponding with the data object, wherein the meta data includes first 
path information. Although not required, this meta data is typically stored in a database. 
As an example, the meta data may be stored in the non volatile memory 116. The first 
path information may include a date and time that the data object was stored (or, for 
5 example, a time identifier, a time code, a time stamp, or other number or identifier). The 
first path information may also include a source identifier and/or a destination identifier. 
The first path information may also include information that identifies hardware devices 
that are in the data path of the data object between the source and the destination, for 
example, intermediary host bus adapters (HB As), routers, and switches. A host bus 

1 0 adapter functions as a network adapter for networks, and may translate, for example, 
SCSI or Fibre Channel. 

[0029] Thus, some examples of the invention may identify the date and time, 
source, target devices, and/or the data path of a data object when a data object is 
transferred, and then store that information as part of meta data for the object. Each time 

15 the data object is transferred, the new time and device information may be updated and 
appended to the meta data. Tracking the complete history provides flexibility and 
usefiihiess, and possibly could be utilized in other computing environments. 

[0030] The sequence 400 may also include operation 408, which comprises 
sending the data object to a second location. As an example, the data object may be sent 

20 fi-om the first location to the second location. Operation 408 may comprise attempting to 
store the data object at the second location. As an example, the second location may be 
on any storage device 206a-d, 208a-e in the storage area network 204. The operation 408 
of sendmg the data object to the second location may comprise attempting to migrate the 
data object to tape, which, as an example, may be the case if the data object was first 

25 stored on a hard disk. The sequence 400 may also include operation 410, which 

comprises determining if the data object was successfully stored at the second location. 
If in operation 410 it is determined that the data object was not successfully stored at the 
second location, then operation 408 may be repeated. If in operation 410 it is determined 
that the data object was successfully stored at the second location, then the sequence 400 
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may also include operation 412, which comprises adding (which may comprise 
appending or pre-pending), second path information to the meta data corresponding with 
the data object, to update the meta data. The second path information may include a date 
and time that the data object was stored at the second location (or, for example, a time 
5 code or other number). The second path information may also include a source identifier 
and/or a destination identifier. As an example, the source may be the first location. The 
second path information may also include information that identifies hardware devices 
that are in the data path of the data object between the source and the destination, for 
example, intermediary host bus adapters (HBAs), routers, and switches. 

1 0 [0031] The sequence 400 may also include operation 414, which comprises 

sending the data object to a third location. Operation 414 may comprise attempting to 
store the data object at the third location. As an example, the third location may be on 
any storage device 206a-d, 208a-e in the storage area network 204. The sequence 400 
may also include operation 416, which comprises determining if the data object was 

1 5 successfiiUy stored at the third location. If in operation 41 6 it is determined that the data 
object was not successfully stored at the third location, then operation 414 may be 
repeated. If in operation 416 it is determined that the data object was successfully stored 
at the third location, then the sequence 400 may also include operation 418, which 
comprises adding third path information to the meta data corresponding with the data 

20 object, to update the meta data. The operations of attempting to store the data object at 
another location, and determining whether the data object was successfully stored, and 
adding additional path information to the meta data corresponding with the data object, 
may be repeated as many times as desired. The third path information may include a date 
and time that the data object was stored at the third location (or, for example, a time code 

25 or other number). The third path information may also include a source identifier and/or 
a destination identifier. As an example, the source may be the second location. The third 
path information may also include information that identifies hardware devices that are in 
the data path of the data object between the source and the destination, such as for 
example intermediary host bus adapters (HBAs), routers, and switches. 
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[0032] In an alternative example, the sequence 400 may comprise sending a 
data object to a plurality of locations at substantially the same time, for example, to a first 
location and a second location at substantially the same time. This may be described as 
simultaneously writing a data object to a plurality of devices. The sequence 400 may also 
5 comprise determining if the data object was successfully stored at the first location and 
the second location, and if so, storing in a single meta data record, meta data 
corresponding with the data object, wherein the meta data comprises first path 
information including a first source identifier, a first destination identifier, and a second 
destination identifier. Alternatively, the sequence 400 may comprise determining if the 

1 0 data object was successfiiUy stored at the first location, and if so, storing in a single meta 
data record, meta data corresponding with the data object, wherein the meta data includes 
first path information comprising a first source identifier and a first destination identifier, 
and, determining if the data object was successfiiUy stored at the second location, and if 
so, also storing in the single meta data record, additional meta data corresponding with 

15 the data object, wherein the additional meta data includes additional first path 

information comprising a second destination identifier. As an example, the first path 
information may fiuther comprise a first time identifier. The sequence 400 may also 
comprise sending the data object fi"om the first location to a third location. The sequence 
400 may also comprise determining if the data object was successfiiUy stored at the third 

20 location, and if so, adding second path information to the meta data corresponding with 
the data object, to update the meta data. As an example, the second path information may 
comprise a second source identifier, a third destination identifier, and a second time 
identifier. 

[0033] In another alternative example, the sequence 400 may comprise sending 
25 a data object to a first location, a second location, and a third location at substantially the 
same time. As an example, the first location may be in a disk array, the second location 
may be a first tape library and the third location may be a second tape library. The 
sequence 400 may also comprise determining if the data object was successfiiUy stored at 
the first location, and if so, storing in a single meta data record, meta data corresponding 
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with the data object, wherein the meta data includes first path information comprising a 
first source identifier and a first destination identifier. The sequence 400 may fiirther 
comprise determining if the data object was successfully stored at the second location, 
and if so, also storing in the single meta data record, first additional meta data 
5 corresponding with the data object, wherein the first additional meta data includes 
additional first path information comprising a second destination identifier. The 
sequence 400 may additionally comprise determining if the data object was successfiiUy 
stored at the third location, and if so, also storing in the single meta data record, second 
additional meta data corresponding with the data object, wherein the second additional 

10 meta data includes additional first path information comprising a third destination 

identifier. The sequence 400 may also comprise sending the data object fi-om the first 
location to a fourth location. The sequence 400 may further comprise determining if the 
data object was successfully stored at the fourth location, and if so, adding second path 
information to the meta data corresponding with the data object, to update the meta data. 

15 As an example, the first path information may further comprise a first time identifier, and 
the second path information may comprise a second source identifier, a fourth destination 
identifier, and second time identifier. 

[0034] The sequence 400 may also include operation 420, which comprises 
detecting that the data object is corrupted, (which may be also be described as 

20 determining whether the data object is corrupted). In some cases the data corruption may 
not be discovered until after attempting to restore the data object fi-om a backup copy of 
data, and error logs may be lost by the time the data error is discovered. As an example, 
the operation of detecting that the data object is corrupted may comprise examining a 
CRC base in the data, or could comprise using other data validation methods. A CRC 

25 base may be embedded in the data. An error is present if the CRC base does not match. 

After detecting that the data object is corrupted, the sequence may also include operations 
422, 424, 426, 428, and 430, which are described below. 

[0035] The sequence 400 may include operation 422, which comprises 
searching for an uncorrupted copy of the corrupted data object. Device histories may be 
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compared to determine the degree of confidence in a copy of a data object. As an 
example, if the device histories are the same, then a copy of the data object is suspect. If 
the device histories are completely different, then there is a greater chance that the copy 
of the data object is not corrupted. The sequence 400 may also include operation 424, 
5 which comprises analyzing error logs for all devices included in the meta data for the 
corrupted data object. The sequence 400 may also include operation 426, which 
comprises identifying all devices included in the meta data for the corrupted data object, 
which indicate an error in their error log. The sequence may also include operation 428, 
which comprises searching meta data corresponding with an additional data object to 

1 0 determine if any device in the meta data for the corrupted data object that has an error in 
its error log is included in the meta data corresponding with the additional data object. If 
in operation 428 it is determined that a device that has an error in its error log is included 
in the meta data corresponding with the additional data object, then the sequence 400 
may also include operation 430, which comprises analyzing the additional data object to 

15 try to determine if the additional data object is corrupted. 

[0036] Prior to the invention, there was no easy way to identify which devices 
in a computing system handled which data objects. Consequently, it was very difficult to 
determine which device caused a data error. Generally, it is very difficult and time 
consuming to examine every device in a storage area network. To avoid this problem, 

20 some examples of the invention focus on the data path, and track which specific devices 
and hardware paths a data object passes through while it is being stored. With some 
examples of the invention, the path of a data object is tracked through the computing 
system, and information indicating which devices have "touched" the data object are 
stored in meta data. Some examples of the invention limit the search for defective 

25 devices, because the hardware paths, devices, and time that data is handled, is tracked, 
thereby providing a clearer trail to follow for problem determination and data recovery. 
If an unreported error occurs during the transfer of a data object, devices whose error logs 
are candidates for being examined can quickly be identified by using the information in 
the meta data. Thus, some examples of the invention facilitate quickly identifying failed 
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hardware and/or software components causing data corruption. The history of data 
objects may be tracked across an entire storage subsystem, from application to end point, 
thereby providing a systematic view of an operation, and devices used to perform the 
operation. In addition, because the time window of an error is known, it is possible to 
5 identify other data objects that may also have been adversely affected. This set of data 
objects may be audited to determine whether additional recovery is needed. With some 
examples of the invention, problem analysis is reduced to a fixed set of devices in a 
window of time, correlations can be better drawn if multiple files fail to restore, and 
further, predictive capability is gained for identifying other suspect data objects. The 

1 0 invention may be called a data centric error tracking and problem analysis method (and 
system), because the error tracking is based on the path traveled by the data. 

[0037] This method of maintaining the history for each creation and copy of a 
file provides a complete history of the data movement. If an error occurs while reading a 
file, the path history can be viewed for pinpointing the time periods to investigate. The 

1 5 investigation can then focus on relevant device error logs, software or hardware changes, 
and/or other storage area network events that could have happened in that window of 
time. In some examples, error logs are examined only for a window of time beginning 
when an uncorrupted version of a data object is last stored, and ending at the time that a 
corrupted version of the data object is first stored. After the source of the data corruption 

20 is identified, proactive steps can be taken to disable defective devices and to evaluate 

other data for corruption and to determine whether the corruption is pervasive or isolated 
to a specific timing window. Other measures, for example CRC checking, can reduce the 
number of data transfers to investigate, but not all environments will be able to support 
those measures. Additionally, it is possible to use the complete history of the data 

25 movement as a level of redundancy if the checking algorithms are not completely trusted. 

[0038] As an example, the invention may be implemented in Tivoli Storage 
Manager, which is a backup data recovery product that is available from IBM 
Corporation. Tivoli Storage Manager includes a backup server and a library or database 
of meta data for data objects, and may be used with a storage area network. A Tivoli 
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Storage Manage- server uses storage area networks to provide data storage services to its 
clients for backup, archive, or space management of data. A Tivoli Storage Manager 
client relies on the Tivoli Storage Manager server to store files for safe keeping. 
Additionally, the invention is also applicable to SCSI environments. Fibre Channel 
5 environments, and other data management products. 

10039] The following are examples of how some embodiments of the invention 
may provide a history of data and its paths. Three separate cases are considered: initial 
creation, object copy for redundant backup, and object migration. Although in these 
examples only the time, source drive, and destination drives, are tracked, the method may 
10 be extended to include any hardware elements in the data path (such as HBA's, routers, 
etc.), as long as that information is available to the software appHcation. 

[0040] (hiitial case) Using Tivoli Storage Manager, a chent stores FILEl to a 
tape drive, DRIVEl, on a tape, TAPEl, on January 3, 2003. If the store is successful, 
meta-data about FILEl is stored in the database. The following patii information is 
15 included in the meta-data for FILEl: 

01/03/2004,14:05,DRIVEl,TAPEl. 

[0041] (Copy case) Later on the same day, a copy of FILEl is made for disaster 
recovery purposes using DRIVE2 as the source drive and DRIVES as the target drive, 
and tape COPYTAPEl. Meta-data for the copy of FILEl is stored in the database along 
20 with its path information as follows: 

01/03/2004,14:05,DRIVE1.TAPE1;01/03/2003,15:00,DRIVE2,DRIVE3,COPYTAPE1. 
This meta data shows that on January 3rd, 2003 at 14:05 this object was created using 
DRIVEl/TAPEl. On January 3, a copy was created at 15:00 and stored on 
COPYTAPEl using DRIVE2 as the source device and using DRIVE3 as the target 
25 device. 

[0042] (Migration case) Two weeks later, FILEl is migrated (moved) from 
TAPEl to TAPE3 using DRIVEl as the source, and DRIVE2 as the target. The meta- 
data stored looks like this: 

01/03/2004,14:05,DRIVEl,TAPEl;01/17/2003,10:00,DRIVEl,DRIVE2,TAPE3. 
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This meta data shows that the file was originally stored January 3rd on TAPEl using 
DRIVEl. Then on January 17th at 10:00 the file was moved to TAPE3 using DRIVE2, 
reading fi-om DRIVEl. 

[0043] If corrupted data is discovered and a defective device is identified, the 
defective device may be replaced to prevent further data errors. As a hypothetical 
example, on March 1, 2003, a file on TAPE3 is read, and is found to be corrupt. The 
history of the corrupt file is examined, which reveals that a copy of the file was created 
on January 1 , 2003, and that the copy made at that time was not corrupted. Further 
examination of a hardware log for DRIVE2 reveals that there were bus errors at the time 
the file was moved. In response, DRIVE2 is replaced to prevent fiuther corruption of 
data. 

III. OTHER EMBODIMENTS 
[0044] While the foregoing disclosure shows a number of illustrative 
embodiments of the invention, it will be apparent to those skilled in the art that various 
changes and modifications can be made herein without departing firom the scope of the 
invention as defined by the appended claims. Furthermore, although elements of the 
invention may be described or claimed in the singular, the plural is contemplated unless 
limitation to the singular is explicitly stated. 



IBM Docket No. SJO920030042US1 



-18- 



